Rationale We did QTL mapping in an F2 mapping population to discover the loci underlying the ovule mutant in M. nudatus GMR2. The F2 population was created by first crossing DHRO22 x GMR2, both at least 6 generations inbred, then selfing the F1. Phenotyping of F2s revealed that the ovule mutant appears to segregate at a 25-50% rate, suggesting a single locus.

An Rmarkdown file (Ovules.Rmd) was created to document analyses. Work was performed in R version 4.0.5 (2021-03-31).

A conda virtual environment (“f1mapping”) was employed for most programs used in this project, with the exception of some scripts to de-aggregate F2s in ddRAD libraries, which used python 2.7.

Note: to execute this file, first activate the conda virtual environment “f1mapping”. Here is the list of installation commands used to create this environment:

conda create -n f1mapping python=3.8 conda install -c conda-forge r-base=4.0.5 conda install -c conda-forge r-knitr=1.33 Rmarkdown and reticulate were installed within R using install.packages()

Cross details

The cross is as follows: DHRO22 x GMR2 –> F1. Selfed F1 for 88 F2s. Both DHRO22 and GMR2 are at least 6 generations inbred.

The total number of sequenced individuals per backcross mapping populations is 88. There were two libraries prepared. One had 48 barcoded individuals and the other had 43. 3 samples in the latter library were from a different mapping project and were added to fill out the plate. Plate layout is found in the file ovule_plate_layout.txt.

Processing of ddRAD reads

Thom Nelson’s preliminary scripts are written for Python 2.7.18. I have installed a virtual environment to work with Python 2. To get there type:

conda activate py2

plates <- read.table('ovule_plate_layout.txt',h=F,sep='\t')
head(plates)

Read files copied into the correct directory then renamed.

cp ~/data/SequencingRun_May2022/ddRADPlates_2022/JW77_*gz /work/01_mapping/03_ovules/
cp ~/data/SequencingRun_May2022/ddRADPlates_2022/JW78_*gz /work/01_mapping/03_ovules/
#make list of library sequencing files
ls -1 JW*gz > sequencing_files.txt
#take second column from library layout
cut -f2 ovule_plate_layout.txt > library_names.txt
#produce new file copying each line from library_names.txt 4x to make new file
perl -ne 'print "$_" x4' library_names.txt > library_names_4x
#strip first two "_" to get file type from first column of sequencing_files.txt
cut -d'_' -f4 sequencing_files.txt > file_type
sed -i 's/I1/.i7/g' file_type 
sed -i 's/R1/.1/g' file_type 
sed -i 's/R2/.i5/g' file_type 
sed -i 's/R3/.2/g' file_type 
paste library_names_4x file_type > test
#replace tab with ""
sed -i 's/\t//g' test
#add .fq.gz to each in test
sed -e 's/$/.fq.gz/' -i test
#paste columns sequencing_files.txt library_names_4x test > test2
paste sequencing_files.txt test > mvcommand
#prefix mvfile with mv command
sed -i 's/JW/mv JW/g' mvcommand
#replace tab with space (because it's ugly not because it matters)
sed -i 's/\t/ /g' mvcommand
echo '#!/bin/bash' >> mv.sh
echo '#' >> mv.sh
echo '#$ -S /bin/bash' >> mv.sh 
cat mv.sh mvcommand > mvscript.sh
chmod 755 mvscript.sh
mv dxg*gz 01_rawreads
#./mvscript.sh

Execute the resulting scripts in py2 virtual environment

./rmdup.sh ovule_library_filenames.txt /work/01_mapping/03_ovules/01_rawreads/
for file in rmdup_*.sh; do sbatch $file; done

Execute the resulting scripts in py2 virtual environment

Flipreads.

Next use flip2BeRAD.py to flip the reads for each half-plate. From Thom Nelson: “For the bestRAD protocol, the genomic DNA fragment (including the inline barcode) gets randomly incorporated into the i5 and i7 adaptors, but the downstream steps require the first/forward read to contain the barcode and rad tag.”

“The output file will be called filtered_forward.fastq and filtered_reverse.fastq, so now is a good time to make separate folders for each half-plate and put the resultng files in those folders.”

The barcode list is in the file ddrad_barcode_seqs.txt. You can see it here:

barcode_seqs <- read.table('ddrad_barcode_seqs.txt',sep='\t',h=F)
head(barcode_seqs)

Number of mismatches for barcodes is set to 0. This insures that reads will be strictly from the individual labeled with that barcode. Setting this higher risks including errant reads and causing errors in genotyping.

Try with offset bases of 0 first. (“The number of basepairs to offset from the 5’ end of the read when searching for barcodes”)

./flip.sh ovule_library_filenames.txt /work/01_mapping/03_ovules/02_flipreads/
for file in *.flip.sh; do sbatch $file; done

Trim reads with trimmomatic v 0.39. Pull out barcodes using STACKS process_radtags http://catchenlab.life.illinois.edu/stacks/comp/process_radtags.php

Make directory for alignments then move deaggregated read files to alignment folder.

Bash script for writing alignment files for each F2. First trim reads, align to TOL v5 genome, add read groups. Collect stats on alignment for each F2.

./trim_parents.sh ovule_genome_sequence_files.txt /work/00_parent_reads/00_raw.reads/ /work/00_parent_reads/00_clean.reads /work/01_mapping/03_ovules/03_align

./align_parents.sh ovule_genome_sequence_files.txt /work/01_mapping/03_ovules/03_align/ /work/00_parent_reads/00_clean.reads/ /data/im767.v2/ Mguttatus.IM767.v2.fa

./rg_ovule_parents.sh ovule_genome_sequence_files.txt /work/01_mapping/03_ovules/03_align/ /hpc/home/picard.jar /data/im767.v2/ Mguttatus.IM767.v2.fa

Create scripts to call SNPs with bcftools.

Filter with bcftools: 1) Filter based on mapping quality; 2) remove repeats; 3) keep only snps

Concatenate all scaffold vcf files.

Use Lepmap3 to construct linkage map

https://sourceforge.net/p/lep-map3/wiki/LM3%20Home

Arguments for filter_vcf_for_lepmap3_ovule_cross.py

–invcf, type=str, required=True, help=“input VCF v 4.1 file” –ParentList, type=str, required=True, help=“input file with parent names, crossing direction (mom or dad, inbred or F1, and minimum and max depth for filtering” –SampleList, type=str, required=True, help=“tab-delimited file with sample Names” –out, type=str, required=True, help = Prefix for output file –minMapQ, type=int, required=True, help = Minimum phred-scaled Mapping Quality –filterMissingParents, type=str2bool, nargs=?, const=True, default=False, help=“Activate filtering out sites with missing parents.” –filterInbredParents, type=str2bool, nargs=?, const=True, default=False, help=“Activate filtering out heterozygous inbred parents.” –filterforhets, type=str2bool, nargs=?, const=True, default=False, help=“Activate filtering homozygous F1s.” –minparentdepth,type=str2bool, nargs=?, const=True, default=False, help = use depth info from parent file to filter by minimum depth –maxparentdepth, type=str2bool, nargs=?, const=True, default=False,help = use depth info from parent file to filter by maximum depth –FractionF2, type = int, default = 0, help = proportion offspring required to pass filter –minF2depth, type=int, required=True, default = 0, help = minimum depth of F2s to be included –missingList, type=str, required=True, help = File with fraction of missing data for each backcross offspring –MaxMissingData, type = float, required=True, help = ‘Maximum missing data allowed for backcross offspring; set to 0 to allow all samples’

cd 05_filter
module add VCFtools

PRE=ovule_f2s
OUT=ovule_f2s

vcftools --vcf $OUT.snps_only.vcf --missing-indv
mv out.imiss $OUT.frac_missing_data.txt
cd 05_filter

PRE=ovule_f2s
OUT=ovule_f2s.min60

python filter_vcf_for_lepmap3_ovule_cross.py --invcf $PRE.snps_only.vcf --ParentList ovule_parents.txt --SampleList ovule_samples.txt --out $OUT --minMapQ 20 --filterInbredParents True --filterforhets True --minparentdepth --maxparentdepth --missingList $PRE.frac_missing_data.txt --MaxMissingData 1 --filterMissingParents True --FractionF2 60 --minF2depth 4 > out

Arguments = Namespace(FractionF2=60, MaxMissingData=1.0, ParentList=‘ovule_parents.txt’, SampleList=‘ovule_samples.txt’, filterInbredParents=True, filterMissingParents=True, filterforhets=True, invcf=‘ovule_f2s.snps_only.vcf’, maxparentdepth=True, minF2depth=4, minMapQ=20, minparentdepth=True, missingList=‘ovule_f2s.frac_missing_data.txt’, out=‘ovule_f2s.min60’) — 6 minutes to complete task — There are 6721051 sites There are 6720200 variant SNP sites There are 6720200 sites with MQ >= 20 There are 2046079 sites with missing parents, 1106860 with het inbreds, and 4900739 with homozygous F1s There are 3301518 sites with low parental depth There are 520 sites with high parental depth There are 3090698 sites where grandparents have the same genotype There are 6570638 sites with fewer than 60 percent F2s There are 149 sites with more than 3 F2 genotypes There are 27602 final sites

line ovule_f2s.min60.max3.vcf 27638 ovule_f2s.min60.max3.vcf

cd 05_filter

PRE=ovule_f2s.min60.max3
OUT=ovule_f2s.min60.max3

module add VCFtools
vcftools --vcf $PRE.vcf --thin 150 --recode --recode-INFO-all --stdout > $OUT.thinned.vcf

vcftools --vcf $OUT.thinned.vcf --missing-indv
mv out.imiss $OUT.frac_missing_data.txt

VCFtools - 0.1.17 (C) Adam Auton and Anthony Marcketta 2009

Thin vcf with vcftools, requiring SNPs to be 150 bp apart.

Parameters as interpreted: –vcf ovule_f2s.min60.max3.vcf –recode-INFO-all –thin 150 –recode –stdout

Outputting VCF file… After filtering, kept 13376 out of a possible 27582 Sites Run Time = 2.00 seconds

(END)

Create pedigree file. DHRO22 is grandmother, GMR2 is grandfather, Pseudo F1s are created as mom and dad, then given as parents for F2s. Replicated sequences for DHRO22 and GMR2 were added into vcf to check phasing later on.

Distribution of missing data

##         0%        25%        50%        75%       100% 
## 0.00000000 0.01302707 0.01839115 0.02654005 0.98347800
suppressMessages(library(dplyr))
## Warning: package 'dplyr' was built under R version 4.2.3
missing <- read.table(here::here('data','ovule_f2s.min60.max3.frac_missing_data.txt'),h=T,sep='\t')
missing <- missing %>% filter(F_MISS<0.4)
f2s <- read.table(here::here('data','ovule_samples.txt'),h=F,sep='\t')
colnames(f2s) <- c('sample')
f2s <- f2s %>% filter(sample %in% missing$INDV)
write.table(f2s,here::here('data','ovule_samples_nomissing.txt'),quote=F,row.names=F,sep='\t',col.names=F)

Warning: Different number of grandparents (4 and 2) in family Fam Found 2 grandparents in family Fam Number of individuals = 90 Number of families = 1 Number of called markers = 13327 (13327 informative) Number of called Z/X markers = 0 chi^2 limits are 15.1357421875, 18.419921875, 21.107421875 Warning: Different number of grandparents (4 and 2) in family Fam Found 2 grandparents in family Fam Number of individuals = 90 Number of families = 1 Number of markers = 13327 Loading file Warning: Different number of grandparents (4 and 2) in family Fam Found 2 grandparents in family Fam Number of individuals = 90 Number of families = 1 File loaded with 11497 SNPs Number of individuals = 88 excluding grandparents Number of families = 1 computing pairwise LOD scores 123456789 done! done! number of LGs = 14 singles = 10 Number of LGs = 14, markers in LGs = 11487, singles = 10

## Warning: package 'htmltools' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.3

Best map is Lod 10 Theta 0.20 Filtered with data Tolerance = 0.0001 and identicalLimit=0.01

Getting phased genotypes for individuals. F2s were phased with grandparent phase (DHRO22 vs GMR2) and without filling in missing data. Assigns the DHRO22 homozygous genotype “2 2”, the GMR2 homozygous genotype “1 1”, and the heterozygous genotypes either “1 2” or “2 1”.

For chromosome 10 and the rest of the linkage groups:

I also checked the number of recombination events per F2 per linkage group. When an individual is outside the general distribution, it suggests well contamination.

Below is a plot of the distribution of recombination events per linkage group.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Create phased genotype file

Some colleagues have found that LepMAP3 preferentially phases missing data as heterozygotes. Replace phases inferred by Lepmap3 with missing data as determined by genotype in vcf.

cd 05_filter
module add VCFtools
vcftools --vcf ovule_f2s.min60.max3.thinned.vcf --bed ovule_sites.bed --recode --recode-INFO-all --out ovule_f2s.finalsites
mv ovule_f2s.finalsites.recode.vcf ovule_f2s.finalsites.vcf
bgzip  ovule_f2s.finalsites.vcf
tabix  ovule_f2s.finalsites.vcf.gz
bcftools view -S ovule_samples_nomissing.txt ovule_f2s.finalsites.vcf.gz | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n' > ovule_f2s.finalsites.genotypes.txt

Distribution of heterozygosity

suppressMessages(library(dplyr))
source('het_distribution.R')

ovule.new.phase <- read.table(here('data','ovule_f2s_phased_w_missing.txt'),h=T,sep='\t')

homo1a <- '2 2'
homo1b <- 'AA'
homo2a <- '1 1'
homo2b <- 'BB'
het1a <- '1 2'
het1b <- 'AB'
het2a <- '2 1'

substitute_geno <- function(geno,homo1a,homo1b,homo2a,homo2b,het1a,het2a,het1b){
  df <- data.frame(lapply(geno,function(x){gsub(homo1a,homo1b,x)}))
  df <- data.frame(lapply(df,function(x){gsub(homo2a,homo2b,x)}))
  df <- data.frame(lapply(df,function(x){gsub(het1a,het1b,x)}))
  df <- data.frame(lapply(df,function(x){gsub(het2a,het1b,x)}))
#  df <- data.frame(lapply(df,function(x){gsub(NA,'NN',x)}))
  df$cM <- as.numeric(df$cM)
  return(df)
}

ovule.geno <- substitute_geno(ovule.new.phase,homo1a,homo1b,homo2a,homo2b,het1a,het2a,het1b)
df <- ovule.geno
for (col in names(df)) {
    # Replace NA with "NN" in each column
    df[is.na(df[[col]]), col] <- "NN"
}

d1 <- draw_het_distribution(df)

median(d1$fraction)
## [1] 0.5190613
mean(d1$fraction)
## [1] 0.5461994
min(d1$fraction)
## [1] 0.3048068
quantile(d1$fraction)
##        0%       25%       50%       75%      100% 
## 0.3048068 0.4700122 0.5190613 0.5861904 0.9347466
write.table(ovule.geno,here::here('data','ovule_sub_genotypes.txt'),quote=F,row.names=F,sep='\t',col.names=T)
highhet <- d1 %>% filter(fraction>=0.8) %>% select(sample)
ovule.geno <- df %>% select(-all_of(matches(highhet$sample)))
write.table(ovule.geno,here::here('data','ovule_sub_genotypes_missing_lowhet.txt'),quote=F,row.names=F,sep='\t')

rQTL

I combined the phased genotypes with the pedigree file and the linkagemap to create a file with linkage group, chromosome, position (bp), genetic distance (cM), and marker number.

I’m using a list of the mutant ovule phenotypes. “Mosaic” individuals are coded as “mutant” for the time being.

ovule.geno <- read.table('ovule_sub_genotypes_missing_lowhet.txt',h=T,sep='\t')
ovule.geno$position <- as.numeric(ovule.geno$position)
ovule.qtl <- ovule.geno %>%
  group_by(lg) %>%
  arrange(position) %>%
  filter(c(TRUE, diff(position) >= 25000))
ovule.qtl <- ovule.qtl[order(ovule.qtl$lg,ovule.qtl$cM),]
ovule.qtl <- ovule.qtl %>% mutate(newmarker=paste('D',lg,sep='')) %>% mutate(newmarker=paste(newmarker,'M',sep='')) %>% mutate(newmarker=paste(newmarker,position,sep='')) %>% mutate(marker=newmarker) %>% select(-newmarker,-position,-contig) %>% select(marker,lg,cM,everything())
ovule.qtl <- ovule.qtl[order(ovule.qtl$lg,ovule.qtl$cM),]

phenotypes <- read.table('FinalPhenotypeCalls.txt',h=T,sep='\t')
sample.names <- read.table('ovule_f2_samples.txt',h=F)
sample.names <- sample.names %>% select(-V1)  
colnames(sample.names) <- c('PlantID','F2')
phenotypes <- inner_join(sample.names,phenotypes,by='PlantID')
phenotypes <- phenotypes %>% select(-PlantID)
colnames(phenotypes)[2] <- 'phenotype'

homo1 <- 'AA'
homo2 <- 'BB'
het <- 'AB'


create_qtl_datasheet <- function(y,n,pheno,homo1,homo2,het){
  y <- t(y)
  x <- rownames(y)
  x <- x[-(1:3)]
  x <- c('id','','',x)
  y <- cbind(x,y)
  rownames(y) <- c()
  colnames(y)[1] <- c('id')
  y <- gsub(homo1,'A',y)
  y <- gsub(homo2,'B',y)
  y <- gsub(het,'H',y)
  y <- gsub('NN','N',y)
  colnames(y) <- c()
  y <- data.frame(y)
  names <- y[,1]
  names <- names[-(1:3)]
  names <- data.frame(names)
  colnames(names) <- 'F2'
  names <- inner_join(names,phenotypes,by='F2')
  names$phenotype <- gsub('MUT',0,names$phenotype)
  names$phenotype <- gsub('MOSAIC',0,names$phenotype)
  names$phenotype <- gsub('WT',1,names$phenotype)
  colnames(names)[1] <- 'id'
  names <- names %>% select(phenotype,id)
  outputnamegen <-paste0(n,"_quantgen.csv")
  outputnamephen <- paste0(n,'_rqtlphen.csv')
  write.table(x=y,outputnamegen,quote=F,row.names=F,col.names=F,sep=',')
  write.table(x=names,outputnamephen,quote=F,row.names=F,col.names=T,sep=',')
  return(y)
}

x <- create_qtl_datasheet(ovule.qtl,'ovule_f2s',phenotypes,homo1,homo2,het)
write.table(x,'ovule.qtl.txt',quote=F,row.names=F,sep='\t')